Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale frequency to suppress RCU CPU stall warning #67

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Mes0903
Copy link
Collaborator

@Mes0903 Mes0903 commented Jan 11, 2025

Since the emulator currently operates using sequential emulation, the execution time for the boot process is relatively long, which can result in the generation of RCU CPU stall warnings.

To address this issue, there are several potential solutions:

  1. Scale the frequency to slow down emulator time during the boot process, thereby eliminating RCU CPU stall warnings.
  2. During the boot process, avoid using clock_gettime to update ticks and instead manage the tick increment relationship manually.
  3. Implement multi-threaded emulation to accelerate the emulator's execution speed.

For the third point, while implementing multi-threaded emulation can significantly accelerate the emulator's execution speed, it cannot guarantee that this issue will not reappear as the number of cores increases in the future. Therefore, a better approach is to use methods 1 and 2 to allow the emulator to set an expected time for completing the boot process.

The advantages and disadvantages of the scale method are as follows:

Advantages:

  • Simple implementation
  • Effectively sets the expected boot process completion time
  • Results have strong interpretability
  • Emulator time can be easily mapped back to real time

Disadvantages:

  • Slower execution speed

The advantages and disadvantages of the increment ticks method are as follows:

Advantages:

  • Faster execution speed
  • Effectively sets the expected boot process completion time

Disadvantages:

  • More complex implementation
  • Some results are difficult to interpret
  • Emulator time is difficult to map back to real time

Based on practical tests, the second method provides limited acceleration but introduces some significant drawbacks, such as difficulty in interpreting results and the complexity of managing the increment relationship. Therefore, this commit opts for the scale frequency method to address this issue.

This commit divides time into emulator time and real time. During the boot process, the timer uses scale frequency to slow down the growth of emulator time, eliminating RCU CPU stall warnings. After the boot process is complete, the growth of emulator time aligns with real time.

To configure the scale frequency parameter, three pieces of information are required:

  1. The expected completion time of the boot process
  2. A reference point for estimating the boot process completion time
  3. The relationship between the reference point and the number of SMPs

According to the Linux kernel documentation:
https://docs.kernel.org/RCU/stallwarn.html#config-rcu-cpu-stall-timeout

The grace period for RCU CPU stalls is typically set to 21 seconds. By dividing this value by two as the expected completion time, we can provide a sufficient buffer to reduce the impact of errors and avoid RCU CPU stall warnings.

Using gprof for basic statistical analysis, it was found that semu_timer_clocksource accounts for approximately 10% of the boot process execution time. Since the logic within semu_timer_clocksource is relatively simple, its execution time can be assumed to be nearly equal to clock_gettime.

Furthermore, by adding a counter to semu_timer_clocksource, it was observed that each time the number of SMPs increases by 1, the execution count of semu_timer_clocksource increases by approximately $2 \times 10^8$ (see the table below).

With this information, we can estimate the boot process completion time as

$$ \text{SecPerCall} \times \text{SMPs} \times 2 \times 10^8 \times \frac{100%}{10%} $$

seconds, and thereby calculate the scale frequency parameter. For instance, if the estimated time is 200 seconds and the target time is 10 seconds, the scaling factor would be 10 / 200.

To calculate the proportion of semu_timer_clocksource using gprof, simply add the -pg option during compilation. Then, terminate the emulator's operation immediately after the first switch to U mode (when the boot process is complete). This approach allows for a rough estimation of the proportion occupied by semu_timer_clocksource.

Below is the gprof testing output:

  • CPU: 13th Gen Intel(R) Core(TM) i7-13700

     mes@DESKTOP-HLQ9F6A:~/semu$ gprof ./semu
     Flat profile:
     
     Each sample counts as 0.01 seconds.
       %   cumulative   self              self     total           
      time   seconds   seconds    calls  Ts/call  Ts/call  name    
      39.91     55.19    55.19                             vm_step
      10.55     69.78    14.59                             semu_timer_clocksource
       9.27     82.60    12.82                             semu_start
       6.49     91.57     8.97                             aclint_mswi_update_interrupts
       6.19    100.13     8.56                             mmu_store
       5.73    108.05     7.92                             mmu_translate
       3.98    113.56     5.51                             semu_timer_get
       3.33    118.16     4.60                             op_rv32i
       2.92    122.20     4.04                             aclint_mtimer_update_interrupts
       2.76    126.02     3.82                             aclint_sswi_update_interrupts
       2.10    128.93     2.91                             ram_read
       1.59    131.13     2.20                             mmu_load
       1.27    132.89     1.76                             mem_store
       1.19    134.54     1.65                             _init
       0.62    135.40     0.86                             mem_load
       0.46    136.04     0.64                             u8250_check_ready
       0.41    136.61     0.57                             ram_write
       0.39    137.15     0.54                             csr_read
       0.37    137.66     0.51                             virtio_net_refresh_queue
       0.28    138.05     0.39                             mem_fetch
       0.17    138.28     0.23                             op_csr_cs
    
  • CPU: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz

     mes@Mes:~/semu$ gprof ./semu
     Flat profile:
     
     Each sample counts as 0.01 seconds.
       %   cumulative   self              self     total           
      time   seconds   seconds    calls  Ts/call  Ts/call  name    
      32.96     77.77    77.77                             vm_step
      15.96    115.43    37.66                             semu_timer_clocksource
      13.95    148.34    32.91                             semu_start
       6.10    162.74    14.40                             aclint_mswi_update_interrupts
       5.95    176.77    14.03                             mmu_store
       5.35    189.39    12.62                             mmu_translate
       3.85    198.47     9.08                             aclint_mtimer_update_interrupts
       2.73    204.91     6.44                             op_rv32i
       2.46    210.72     5.81                             ram_read
       2.11    215.70     4.98                             aclint_sswi_update_interrupts
       1.74    219.81     4.11                             mmu_load
       1.74    223.91     4.10                             semu_timer_get
       1.30    226.98     3.07                             mem_store
       0.92    229.14     2.16                             ram_write
       0.81    231.05     1.91                             _init
       0.61    232.48     1.43                             mem_load
       0.55    233.78     1.30                             csr_read
       0.35    234.60     0.82                             u8250_check_ready
       0.22    235.11     0.51                             op_csr_cs
       0.21    235.61     0.50                             virtio_net_refresh_queue
       0.14    235.95     0.34                             mem_fetch
       0.00    235.96     0.01                             semu_timer_init
    

And by adding a counter to semu_timer_clocksource, it becomes possible to calculate the relationship between SMPs and the number of times semu_timer_clocksource is called.

Below is the testing output(13th Gen Intel(R) Core(TM) i7-13700):

SMP times call semu_timer_total_ticks
1 239,937,385
2 410,377,969
3 600,190,253
4 825,078,230
5 1,007,098,511
6 1,213,304,632
7 1,419,857,500
8 1,627,353,803
9 1,835,991,887
10 2,056,837,458
11 2,269,651,319
12 2,485,299,111
13 2,703,809,852
14 2,917,880,774
15 3,134,960,385
16 3,450,041,050
17 3,587,880,492
18 3,810,315,132
19 4,052,935,937
20 4,274,307,156
21 4,503,423,577
22 4,742,850,891
23 4,973,145,039
24 5,213,953,075
25 5,597,276,263
26 5,696,829,697
27 5,935,553,504
28 6,173,571,960
29 6,620,649,951
30 6,683,572,727
31 6,942,616,048
32 7,180,650,711

@jserv jserv requested review from chiangkd and ranvd January 11, 2025 16:51
jserv

This comment was marked as resolved.

#define SEMU_BOOT_TARGET_TIME 10
#endif

bool boot_complete = false;
Copy link
Collaborator

@ranvd ranvd Jan 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest moving boot_complete variable into vm_t for a more conceptually accurate design.

Copy link
Collaborator Author

@Mes0903 Mes0903 Jan 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we move boot_complete into vm_t, all existing functions for semu_timer_t would need an additional vm_t parameter. For example, semu_timer_get would change to:

semu_timer_get(vm_t *vm, semu_timer_t *timer)

This change would indirectly require the areas that call this function to also pass in a vm_tparameter. For instance, since semu_timer_get is called within aclint_mtimer_update_interrupts, the API of aclint_mtimer_update_interrupts would also need to be updated to include vm_t.

As this pattern continues, the API changes would proliferate significantly. Perhaps we could introduce a static bool pointer pointing to boot_complete and assign its value during semu_timer_init. This way, we would only need to modify the parameters of semu_timer_init.

utils.c Outdated
Comment on lines 64 to 68
struct timespec start, end;
clock_gettime(CLOCKID, &start);

for (uint64_t loops = 0; loops < target_loop; loops++)
clock_gettime(CLOCKID, &end);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not get the idea behind this code snip. What did you update the value end several times?

Copy link
Collaborator Author

@Mes0903 Mes0903 Jan 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is meant to measure the execution time of target_loop times clock_gettime call. In my understanding, the following code can achieve the same purpose:

struct timespec start, end;
clock_gettime(CLOCKID, &start);

for (uint64_t loops = 0; loops < target_loop - 1; loops++) {
    struct timespec ts;
    clock_gettime(CLOCKID, &ts);
}

clock_gettime(CLOCKID, &end);

However, in the for loop, the variable ts is not used. Therefore, I simply replaced it with end as the parameter of clock_gettime. This way, when exiting the loop, the value of end will be the time from the last execution of clock_gettime, which can also achieve the purpose of measure the execution time of target_loop times clock_gettime call.

Then, by dividing target_loop, the execution time of a single clock_gettime call can be calculated.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add comments to address the measurement for HRT syscall overhead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, you should wrap POSIX, macOS, and time utility in standard library.

utils.c Outdated Show resolved Hide resolved
utils.c Outdated Show resolved Hide resolved
@Mes0903
Copy link
Collaborator Author

Mes0903 commented Jan 13, 2025

I noticed that the time during boot process was far from the target time on macOS. I think this may cause by the hypothesis of " $SMP \times 2 \times 10^8$ calls of semu_timer_clocksource" or "semu_timer_clocksource is ~10% of boot overhead", is not correct on macOS. But the logic of the function was implemented now.

utils.c Outdated
Comment on lines 103 to 106
/* Perform 'target_loop' times calling the host HRT. */
for (uint64_t loops = 0; loops < target_loop; loops++)
(void) host_time_ns();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See Accurate Benchmarking to eliminate loop overhead.

utils.h Outdated Show resolved Hide resolved
utils.c Outdated Show resolved Hide resolved
utils.c Outdated Show resolved Hide resolved
utils.c Outdated Show resolved Hide resolved
utils.c Outdated Show resolved Hide resolved
utils.h Outdated Show resolved Hide resolved
@Mes0903 Mes0903 force-pushed the scale-frequency branch 2 times, most recently from bdf4a62 to cb265ae Compare January 15, 2025 03:55
@jserv
Copy link
Collaborator

jserv commented Jan 15, 2025

Test under macOS + Apple M1:

$ make SMP=4 check
...
[  388.139490] rcu: INFO: rcu_sched self-detected stall on CPU
[  388.156300] rcu: 	0-....: (41291 ticks this GP) idle=bdfc/1/0x40000002 softirq=1137/1137 fqs=6808
[  388.187007] 	(t=68790 jiffies g=-327 q=5 ncpus=4)
[  388.205900] CPU: 0 PID: 37 Comm: kworker/u8:1 Not tainted 6.1.99 #1
[  388.226518] Hardware name: semu (DT)
[  388.239863] Workqueue: events_unbound async_run_entry_fn
[  388.260077] epc : do_reset+0x30/0x6c
[  388.278895]  ra : do_reset+0x38/0x6c
[  388.297349] epc : c03816d0 ra : c03816d8 sp : c09d7e50
[  388.314871]  gp : c04de840 tp : c09d8000 t0 : c0bd3e50
[  388.332163]  t1 : 00000008 t2 : 8a9164d2 s0 : c09d7e60
[  388.349003]  s1 : c03a48b8 a0 : 00000001 a1 : 00000000
[  388.365395]  a2 : 00000000 a3 : 007a5804 a4 : 0005a7fb
[  388.381809]  a5 : 00000000 a6 : 00000000 a7 : 0000000c
[  388.398328]  s2 : 004725a7 s3 : 0038da58 s4 : 00000000
[  388.415347]  s5 : 00000030 s6 : 00000007 s7 : c03a496c
[  388.432367]  s8 : c03a48b8 s9 : c0381488 s10: c03a4968
[  388.450494]  s11: c0381894 t3 : c0b9d440 t4 : 00000003
[  388.467256]  t5 : 94d7d78e t6 : c09d7e08
[  388.481446] status: 00000120 badaddr: dfea5804 cause: 80000005
[  388.500241] [<c03816d0>] do_reset+0x30/0x6c
[  388.521698] [<c0381870>] write_buffer+0x40/0x64
[  388.543520] [<c038207c>] unpack_to_rootfs+0x1c4/0x300
[  388.579518] [<c0382848>] do_populate_rootfs+0x70/0xd8
[  388.601539] [<c002bd8c>] async_run_entry_fn+0x30/0xd0
[  388.623279] [<c0021990>] process_one_work+0x198/0x224
[  388.644797] [<c0021f18>] worker_thread+0x238/0x294
[  388.667502] [<c00289c0>] kthread+0xd4/0xd8
[  388.687176] [<c0002514>] ret_from_exception+0x0/0x1c
[  416.399001] Freeing initrd memory: 8188K
...

Since the emulator currently operates using sequential emulation, the
execution time for the boot process is relatively long, which can result
in the generation of RCU CPU stall warnings.

To address this issue, there are several potential solutions:

1. Scale the frequency to slow down emulator time during the boot
   process, thereby eliminating RCU CPU stall warnings.
2. During the boot process, avoid using 'clock_gettime' to update ticks
   and instead manage the tick increment relationship manually.
3. Implement multi-threaded emulation to accelerate the emulator's
   execution speed.

For the third point, while implementing multi-threaded emulation can
significantly accelerate the emulator's execution speed, it cannot
guarantee that this issue will not reappear as the number of cores
increases in the future. Therefore, a better approach is to use methods
1 and 2 to allow the emulator to set an expected time for completing the
boot process.

The advantages and disadvantages of the scale method are as follows:

Advantages:
- Simple implementation
- Effectively sets the expected boot process completion time
- Results have strong interpretability
- Emulator time can be easily mapped back to real time

Disadvantages:
- Slower execution speed

The advantages and disadvantages of the increment ticks method are as
follows:

Advantages:
- Faster execution speed
- Effectively sets the expected boot process completion time

Disadvantages:
- More complex implementation
- Some results are difficult to interpret
- Emulator time is difficult to map back to real time

Based on practical tests, the second method provides limited
acceleration but introduces some significant drawbacks, such as
difficulty in interpreting results and the complexity of managing the
increment relationship. Therefore, this commit opts for the scale
frequency method to address this issue.

This commit divides time into emulator time and real time. During the
boot process, the timer uses scale frequency to slow down the growth of
emulator time, eliminating RCU CPU stall warnings. After the boot
process is complete, the growth of emulator time aligns with real time.

To configure the scale frequency parameter, three pieces of information
are required:

1. The expected completion time of the boot process
2. A reference point for estimating the boot process completion time
3. The relationship between the reference point and the number of SMPs

According to the Linux kernel documentation:
https://docs.kernel.org/RCU/stallwarn.html#config-rcu-cpu-stall-timeout

The grace period for RCU CPU stalls is typically set to 21 seconds. By
dividing this value by two as the expected completion time, we can
provide a sufficient buffer to reduce the impact of errors and avoid
RCU CPU stall warnings.

Using 'gprof' for basic statistical analysis, it was found that
'semu_timer_clocksource' accounts for approximately 10% of the boot
process execution time. Since the logic within 'semu_timer_clocksource'
is relatively simple, its execution time can be assumed to be nearly
equal to 'clock_gettime'.

Furthermore, by adding a counter to 'semu_timer_clocksource', it was
observed that each time the number of SMPs increases by 1, the execution
count of 'semu_timer_clocksource' increases by approximately '2 * 10^8'

With this information, we can estimate the boot process completion time
as 'sec_per_call * SMPs * 2 * 10^8 * (100% / 10%)' seconds, and thereby
calculate the scale frequency parameter. For instance, if the estimated
time is 200 seconds and the target time is 10 seconds, the scaling
factor would be '10 / 200'.
@jserv
Copy link
Collaborator

jserv commented Jan 15, 2025

Test on Ubuntu Linux + eMAG (32-core Armv8-A):

$ make SMP=32 check
...
[   35.995833] rcu: INFO: rcu_sched self-detected stall on CPU
[   36.000832] rcu: 	6-....: (5249 ticks this GP) idle=111c/1/0x40000002 softirq=1136/1136 fqs=1950
[   36.011131] 	(t=5252 jiffies g=-987 q=11 ncpus=32)
[   36.016552] CPU: 6 PID: 8 Comm: kworker/u64:0 Not tainted 6.1.99 #1
[   36.023057] Hardware name: semu (DT)
[   36.026911] Workqueue: events_unbound async_run_entry_fn
[   36.033356] epc : eat+0x20/0x54
[   36.039018]  ra : do_reset+0x38/0x6c
[   36.045041] epc : c0381544 ra : c03816d8 sp : c0847e40
[   36.050702]  gp : c04de840 tp : c087af40 t0 : c1325e50
[   36.056304]  t1 : 00000008 t2 : e893c869 s0 : c0847e50
[   36.061664]  s1 : c03a48b8 a0 : 00000001 a1 : 00000000
[   36.066844]  a2 : 00000000 a3 : 0054cb37 a4 : dfc4cb38
[   36.072265]  a5 : c03a48b8 a6 : 00000000 a7 : 0000000c
[   36.077565]  s2 : 004725a7 s3 : 0038da58 s4 : 00000000
[   36.083106]  s5 : 00000030 s6 : 00000007 s7 : c03a496c
[   36.088466]  s8 : c03a48b8 s9 : c0381488 s10: c03a4968
[   36.094128]  s11: c0381894 t3 : c130ff40 t4 : 00000003
[   36.099609]  t5 : 56e350ad t6 : c0847e08
[   36.104066] status: 00000120 badaddr: c03a48c4 cause: 80000005
[   36.110149] [<c0381544>] eat+0x20/0x54
[   36.116774] [<c03816d8>] do_reset+0x38/0x6c
[   36.123580] [<c0381870>] write_buffer+0x40/0x64
[   36.130567] [<c038207c>] unpack_to_rootfs+0x1c4/0x300
[   36.138036] [<c0382848>] do_populate_rootfs+0x70/0xd8
[   36.144179] [<c002bd8c>] async_run_entry_fn+0x30/0xd0
[   36.150865] [<c0021990>] process_one_work+0x198/0x224
[   36.157731] [<c0021f18>] worker_thread+0x238/0x294
[   36.164296] [<c00289c0>] kthread+0xd4/0xd8
[   36.169957] [<c0002514>] ret_from_exception+0x0/0x1c
[   70.530512] Freeing initrd memory: 8188K

@Mes0903
Copy link
Collaborator Author

Mes0903 commented Jan 15, 2025

Based on the output, I think it clearly that the assumption of " $SMP \times 2 \times 10^8$ calls of semu_timer_clocksource" or "semu_timer_clocksource is ~10% of boot overhead" does not hold on macOS. We need to re-evaluate the number of semu_timer_clocksource calls and its proportion on macOS to accurately calculate the scale factor.

@@ -56,6 +56,8 @@ OBJS := \
aclint.o \
$(OBJS_EXTRA)

LDFLAGS := -pg
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The -pg instrumentation option might be too heavy. You should consider more lightweight ways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants